The data set I worked with is box scores from NBA games from the 2021-2022 NBA season. This data was retrieved from NBA.com and has 24 columns and 2461 rows. Each row of the dataset is a team’s performance for a given game, meaning that each game has two rows, one for each team.
https://docs.google.com/spreadsheets/d/1FcJVyAggt7qLm3J-gh1_cReJeQn_8jOQd77BaDY3crE/edit?usp=sharing| Column | Data Type | Description |
|---|---|---|
| Team | String | Team who’s data is shown |
| Matchup | String | Teams playing in the game |
| Game Date | String | Date of the game |
| Win or Loss | Boolean | If the team won or lost |
| MIN | Double | Length of game play (minutes) |
| PTS | Double | Points scored |
| FGM | Double | Field goals made |
| FGA | Double | Field goals attempted |
| FGP | Double | Field goal percentage |
| 3PM | Double | 3 pointers made |
| 3PA | Double | 3 pointers attempted |
| 3PP | Double | 3 point percentage |
| FTM | Double | Free throws made |
| FTA | Double | Free throws attempted |
| FTP | Double | Free throw percentage |
| OREB | Double | Offensive rebounds |
| DREB | Double | Defensive rebounds |
| REB | Double | Total rebounds |
| AST | Double | Assists |
| STL | Double | Steals |
| BLK | Double | Blocks |
| TOV | Double | Turnovers |
| PF | Double | Personal fouls |
| Plus Minus | Double | Team Plus-Minus |
| Team | Matchup | Game Date | Win or Loss | MIN | PTS | FGM | FGA | FGP | 3PM | 3PA | 3PP | FTM | FTA | FTP | OREB | DREB | REB | AST | STL | BLK | TOV | PF | Plus Minus |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SAS | SAS @ DAL | 2022-04-10 | L | 240 | 120 | 43 | 89 | 48.3 | 11 | 31 | 35.5 | 23 | 23 | 100.0 | 7 | 28 | 35 | 26 | 15 | 3 | 8 | 17 | -10 |
| BOS | BOS @ MEM | 2022-04-10 | W | 240 | 139 | 54 | 99 | 54.5 | 18 | 48 | 37.5 | 13 | 13 | 100.0 | 14 | 42 | 56 | 34 | 5 | 2 | 15 | 20 | 29 |
| IND | IND @ BKN | 2022-04-10 | L | 240 | 126 | 47 | 104 | 45.2 | 19 | 46 | 41.3 | 13 | 19 | 68.4 | 11 | 19 | 30 | 32 | 16 | 1 | 7 | 23 | -8 |
| MEM | MEM vs. BOS | 2022-04-10 | L | 240 | 110 | 39 | 102 | 38.2 | 15 | 47 | 31.9 | 17 | 27 | 63.0 | 19 | 26 | 45 | 27 | 11 | 6 | 10 | 16 | -29 |
| DEN | DEN vs. LAL | 2022-04-10 | L | 265 | 141 | 49 | 100 | 49.0 | 15 | 47 | 31.9 | 28 | 36 | 77.8 | 11 | 34 | 45 | 33 | 10 | 7 | 14 | 34 | -5 |
| LAL | LAL @ DEN | 2022-04-10 | W | 265 | 146 | 44 | 94 | 46.8 | 16 | 43 | 37.2 | 42 | 47 | 89.4 | 13 | 37 | 50 | 26 | 4 | 6 | 13 | 24 | 5 |
| HOU | HOU vs. ATL | 2022-04-10 | L | 240 | 114 | 41 | 89 | 46.1 | 17 | 46 | 37.0 | 15 | 20 | 75.0 | 6 | 28 | 34 | 24 | 4 | 4 | 8 | 19 | -16 |
| SAC | SAC @ PHX | 2022-04-10 | W | 240 | 116 | 40 | 76 | 52.6 | 14 | 26 | 53.8 | 22 | 30 | 73.3 | 2 | 38 | 40 | 26 | 9 | 7 | 15 | 18 | 7 |
| UTA | UTA @ POR | 2022-04-10 | W | 240 | 111 | 37 | 82 | 45.1 | 9 | 36 | 25.0 | 28 | 38 | 73.7 | 15 | 45 | 60 | 23 | 8 | 6 | 17 | 16 | 31 |
| POR | POR vs. UTA | 2022-04-10 | L | 240 | 80 | 31 | 83 | 37.3 | 9 | 34 | 26.5 | 9 | 12 | 75.0 | 5 | 27 | 32 | 21 | 11 | 8 | 16 | 27 | -31 |
| PHX | PHX vs. SAC | 2022-04-10 | L | 240 | 109 | 42 | 103 | 40.8 | 14 | 47 | 29.8 | 11 | 15 | 73.3 | 18 | 32 | 50 | 27 | 9 | 7 | 11 | 25 | -7 |
| CLE | CLE vs. MIL | 2022-04-10 | W | 240 | 133 | 51 | 94 | 54.3 | 19 | 38 | 50.0 | 12 | 17 | 70.6 | 10 | 38 | 48 | 39 | 5 | 5 | 12 | 26 | 18 |
| MIL | MIL @ CLE | 2022-04-10 | L | 240 | 115 | 39 | 88 | 44.3 | 12 | 30 | 40.0 | 25 | 32 | 78.1 | 8 | 33 | 41 | 27 | 7 | 2 | 12 | 14 | -18 |
I wanted to ask myself “What variables specifically are the most impactful for deciding a teams winning probability?” and what specific values of our variables would we predict a win vs a loss from.
Logistic regression is a great method for prediction between two states, so I used it to predict whether a team won or lost a game. We used the same variables as our NaiveBayes model which are points, 3 point makes, defensive rebounds, steals, blocks, and turnovers to predict wins and losses.
Shown in the plots on the right is a logistic regression fit of the four most significant variables (points, defensive rebounds, steals, and turnovers) to wins. The strongest predictors are points and defensive rebounds as clearly shown in their logistic regression plots. At and above 110 points scored, teams are more likely to win than lose. Above 34 defensive rebounds, winning percentage is above 0.5. The slope of the logistic fit for steals is far more consistent than that of points and defensive rebounds. This is because having a lot of steals is less impact on the outcome of the game than the other two categories. Turnovers are similar to steals in that they have a lesser impact on the outcome of the game than the other two, however it is the only fit with a negative slope. All four plots are as expected and do a great job of showing which variables are most important to winning and what values teams should aim for.
For multiple linear regression the problem I was trying to answer is what is the best model for predicting the +/- (Plus/Minus) a team had based on their other box score stats. First I will use model selection to attempt to find the most precise model, and then I will analyze the results of that model and assess the accuracy when predicting the number of rebounds of the game.
AIC
BICq equivalent for q in (0.857417450547111, 0.974008663992208)
Best Model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -42.08382733 3.34255085 -12.590333 2.886426e-35
pts 0.78600753 0.02536539 30.987404 3.839835e-178
FGA -1.31887041 0.04118466 -32.023341 3.095563e-188
TPP 0.08904543 0.02865367 3.107645 1.907504e-03
FTM -0.86288376 0.04107072 -21.009704 3.108209e-90
FTP 0.19158907 0.01650551 11.607581 2.318358e-30
OREB -0.13141946 0.06092058 -2.157226 3.108513e-02
REB 1.52644661 0.03203467 47.649834 0.000000e+00
AST 0.09238629 0.04507340 2.049686 4.050150e-02
STL 1.49313360 0.05967553 25.020867 3.203795e-123
BLK 0.37541120 0.06787868 5.530620 3.529889e-08
TOV -1.19741924 0.04960223 -24.140434 1.136050e-115
PF 0.13026598 0.04175272 3.119940 1.829957e-03
Once again, I am attempting to use the information from these NBA box statistics to accurately predict the +/- of the game. The +/- (plus minus) is an integer that is the total number of points a team scored minus the total number the opponent scored. It will be negative if the team lost, and positive if the team won. I had to remove FGM, FTA, and DREB, as there was multicolinearity present with these and other variables. I decided to do best subsets approach for my variable selection for my model. Using AIC as my information criterion I was left with a model that included more variables than I expected. The model I got was +/- = PTS + FGA + 3P% + FTM + FT% + OREB + REB + AST + STL + BLK + TOV + PF. All of the variables in this model were statistically significant. The model with coefficients (rounded to the 10 thousandth) is +/- = -42.0838 + 0.7860 * PTS - 1.3189 * FGA + 0.0890 * 3P% - 0.8629 * FTM + 0.1916 * FT% - 0.1314 * OREB + 1.5264 * REB + 0.0924 * AST + 1.4931 * STL + 0.3754 * BLK - 1.1974 * TOV + 0.1303 * PF. One interesting thing I noticed was that rebounds had the largest coefficient, which surprised me. I thought it would be points but maybe since there are more points than rebounds in a game that rebounds had a larger estimated coefficient. The adjusted R-squared of this model was 0.7313 or that 73.13% of the variability of the plus minus can be explained by this model. Now we can continue on with checking our assumptions for multiple variable linear regression. These values can be shown in the coefficient plot on the right hand side, which has the coefficient, as well as a confidence interval for that variable’s optimal coefficient.
We can see that when deciding a teams +/- we have rebounds and steals as the two most important variables. Intuitively this makes sense, as when you rebound the ball, you ensure another possession, whether the opposing team missed a shot or your team did. Furthermore, when your team gets a steal, you ensure another possession, which can explain the large coefficients found for both variables. One thing I found very surprising was that Field Goals Attempted (FGA) has the lowest estimated coefficient, at a value of -1.32. This can be explained by the model accounting for teams that are just chucking up shots (volume over quality), which is why points has a positive coefficient, so if you shoot you get penalized but if the shot goes in it counteracts the penalty. Finally, the second lowest estimated coefficient is for turnovers (TOV) as the opposite of steals, you ensuring that you lost a possession.
The question I sought to answer is “How accurately can we predict wins and losses based on the box score variables?”.
I determined which variables to use using stepAIC and other model selection tools leading us to use points, 3 point makes, defensive rebounds, steals, blocks, and turnovers to predict if a team had won or lost.
Win or Loss ~ PTS + 3PM + DREB + STL + BLK + TOV
I used a randomized 80/20 split between training and testing data giving 1979 training observations and 481 testing observations. This split can be seen in the table on the right where we have a small subset of the testing observations with the prediction and actual result. Also shown are some game details and the variables I used to make my predictions. One thing to keep in mind is that each observation is only one teams perspective and that the opposing team’s perspective is recoded in a seperate row. An example of this is with rows 21 and 22, shown on the right, where the Detroit Pistons lost the Philidelphia 76ers in Philly on 4/10/22.
Reference
Prediction 0 1
0 175 62
1 49 195
As you can see with the confusion matrix above, our model is about 77% accurate at predicting a win or a loss from the variables discussed with out-of-sample testing. This is an amazing accuracy for the situation since all of these variables are highly dependent on the pace of the game. For example, it is hard to use only these variables to make predictions for both a slow-paced team and a fast-paced team.
| Team | Matchup | Game.Date | PTS | X3PM | DREB | STL | BLK | TOV | Win.or.Loss | Predicted.Result |
|---|---|---|---|---|---|---|---|---|---|---|
| PHX | PHX vs. SAC | 2022-04-10 | 109 | 14 | 32 | 9 | 7 | 11 | 0 | 1 |
| CLE | CLE vs. MIL | 2022-04-10 | 133 | 19 | 38 | 5 | 5 | 12 | 1 | 1 |
| ORL | ORL vs. MIA | 2022-04-10 | 125 | 23 | 42 | 4 | 3 | 10 | 1 | 1 |
| DET | DET @ PHI | 2022-04-10 | 106 | 11 | 27 | 4 | 4 | 20 | 0 | 0 |
| PHI | PHI vs. DET | 2022-04-10 | 118 | 5 | 32 | 13 | 6 | 11 | 1 | 1 |
| CHI | CHI @ MIN | 2022-04-10 | 124 | 10 | 32 | 9 | 3 | 23 | 1 | 1 |
| GSW | GSW @ NOP | 2022-04-10 | 128 | 19 | 34 | 5 | 2 | 17 | 1 | 1 |
| LAC | LAC vs. OKC | 2022-04-10 | 138 | 18 | 45 | 4 | 8 | 9 | 1 | 1 |
| NOP | NOP @ MEM | 2022-04-09 | 114 | 6 | 20 | 10 | 3 | 16 | 0 | 0 |
| BKN | BKN vs. CLE | 2022-04-08 | 118 | 12 | 32 | 5 | 8 | 11 | 1 | 1 |
| UTA | UTA vs. PHX | 2022-04-08 | 105 | 11 | 29 | 8 | 4 | 11 | 0 | 0 |
| MIL | MIL @ DET | 2022-04-08 | 131 | 11 | 41 | 7 | 1 | 8 | 1 | 1 |
| SAS | SAS @ MIN | 2022-04-07 | 121 | 10 | 37 | 5 | 5 | 12 | 0 | 1 |
| TOR | TOR vs. PHI | 2022-04-07 | 119 | 15 | 29 | 7 | 4 | 11 | 1 | 0 |
| LAC | LAC vs. PHX | 2022-04-06 | 113 | 12 | 46 | 7 | 9 | 17 | 1 | 1 |
| WAS | WAS @ ATL | 2022-04-06 | 103 | 10 | 38 | 4 | 4 | 14 | 0 | 0 |
| PHX | PHX @ LAC | 2022-04-06 | 109 | 17 | 36 | 11 | 2 | 12 | 0 | 1 |
| BKN | BKN vs. HOU | 2022-04-05 | 118 | 15 | 40 | 6 | 9 | 17 | 1 | 1 |
| CHA | CHA @ MIA | 2022-04-05 | 115 | 12 | 24 | 8 | 3 | 15 | 0 | 0 |
| CHI | CHI vs. MIL | 2022-04-05 | 106 | 9 | 29 | 8 | 2 | 15 | 0 | 0 |
For my natural cubic splines methodology I am trying to find the degree of freedom (DF) that minimizes the in sample sum of squares error. U will be using a response variable of win or loss with points as the predictor. After I find the best fit natural cubic splines we will then work on finding the best natural cubic splines on the out of sample SSE or the testing data set.
dfs SSE
1 1 343.3190
2 2 342.8324
3 3 338.8375
4 4 338.8468
5 5 338.3369
6 6 338.2999
7 7 336.8774
8 8 336.5399
9 9 336.7557
As we can see from both the data frame of DFs vs SSE and the plot of DFs vs SSE, the optimal number of degrees of freedom is 8 with a sum of squares error of 336.5399. Now we will plot the fit corresponding to 8 degrees of freedom. As we can see in the plot, increasing DF by generally decreases SSE. It appears to start to plateau and if we were to run with a higher max DF we would see that in the graph.
As we can see from the natural cubic splines graph, we have a relationship between points scored and win or loss. As expected, the more points a team scores, the better chance they have of winning. The natural cubic splines has a much larger standard error towards the ends of the data set, as the accuracy is not nearly as low. This is due to the low number of data points at the ends of the points and in turn the model can not be as confident in its predictions.
dfs SSEt
1 1 128.5914
2 2 127.8102
3 3 127.5632
4 4 127.5204
5 5 127.9212
6 6 128.3789
7 7 128.6045
8 8 128.6476
9 9 128.7866
Now I am attempting to fit a natural cubic spline on the testing data
set. I will see if we can minimize the sum of square error using the
natural spline for the training data set. I am still using the same
response, win or loss, and the same predictor, points. We can see from
this graph of df versus SSE, that the optimal number of degrees of
freedom is 4. This minimizes the sum of squares error to be 127.6512.
This is far lower than the degree of freedom we got from the training
data set which was 8, but the size of the data is smaller too.We can see that knots are all close to the center, which means the different cubic polynomials are joined right around the middle, or the 98 - 115 point range. It appears to be linear in the middle and quadratic around the ends of the graph. This was suprising as I assumed that as the number of points increases the probability of winning would also increase. We can see that around the ends the spline starts to flare out as their is less data around these points.
For K-Nearest Neighbors I will be attempting to predict if a team Wins or Losses based off the number of rebounds they secured, number of points scored, and turnovers. I will be converting Wins or Losses into a factor to enable KNN to be a classification to predict if the team won or lost based off points, rebounds and turnovers. Then we will be comparing this to a prediction using all the variables.
Cell Contents
|-------------------------|
| N |
| N / Col Total |
|-------------------------|
Total Observations in Table: 738
| test_classes
knn_classes | 0 | 1 | Row Total |
-------------|-----------|-----------|-----------|
0 | 250 | 99 | 349 |
| 0.710 | 0.256 | |
-------------|-----------|-----------|-----------|
1 | 102 | 287 | 389 |
| 0.290 | 0.744 | |
-------------|-----------|-----------|-----------|
Column Total | 352 | 386 | 738 |
| 0.477 | 0.523 | |
-------------|-----------|-----------|-----------|
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 250 99
1 102 287
Accuracy : 0.7276
95% CI : (0.694, 0.7595)
No Information Rate : 0.523
P-Value [Acc > NIR] : <2e-16
Kappa : 0.4539
Mcnemar's Test P-Value : 0.8878
Sensitivity : 0.7102
Specificity : 0.7435
Pos Pred Value : 0.7163
Neg Pred Value : 0.7378
Prevalence : 0.4770
Detection Rate : 0.3388
Detection Prevalence : 0.4729
Balanced Accuracy : 0.7269
'Positive' Class : 0
When using rebounds, points, and turnovers as features for predicting if the team won or lost that game, I got the cross table displayed. If you look in the top left corner you can see how accurate K-nearest neighbors was at predicting if the team lost. I can see that they correctly predicted it 254 times out of 351 games. This means it was 72.4% accurate at predicting if the team lost or not. If you look at the second diagonal you can see the number of times KNN predicted correctly if the team won or not. It did so 280 times out of 387 games total, which comes out to approximately 72.4% accurate. Now we will try to predict the same thing but we will use all of the variables in our data. The overall accuracy equation is Sum(DIAG)/Sum(Everything) which equals (250 + 287)/(250 + 287 + 99 + 102) = 0.7276423, which is what we got from the confusion matrix.
Cell Contents
|-------------------------|
| N |
| N / Col Total |
|-------------------------|
Total Observations in Table: 738
| test_classes
knn_classes_all | 0 | 1 | Row Total |
----------------|-----------|-----------|-----------|
0 | 299 | 121 | 420 |
| 0.849 | 0.313 | |
----------------|-----------|-----------|-----------|
1 | 53 | 265 | 318 |
| 0.151 | 0.687 | |
----------------|-----------|-----------|-----------|
Column Total | 352 | 386 | 738 |
| 0.477 | 0.523 | |
----------------|-----------|-----------|-----------|
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 299 121
1 53 265
Accuracy : 0.7642
95% CI : (0.7319, 0.7944)
No Information Rate : 0.523
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5314
Mcnemar's Test P-Value : 3.789e-07
Sensitivity : 0.8494
Specificity : 0.6865
Pos Pred Value : 0.7119
Neg Pred Value : 0.8333
Prevalence : 0.4770
Detection Rate : 0.4051
Detection Prevalence : 0.5691
Balanced Accuracy : 0.7680
'Positive' Class : 0
When I conducted K-Nearest Neighbors using all relevant variables, we get the cross table and the confusion matrix displayed above. When trying to predict on the 738 testing observation the model correctly predict 299 losses and 265 wins. To get the overall accuracy I can add these together and divide by the total, (299 + 265)/738 = 76.42276%. This is the same accuracy value we got from the confusionMatrix just below the cross table. There were a total of 352 losses and 386 wins in the testing data. This means that we predicted the losses 299/352 = 84.9% of the time and the wins 265/286 = 68.7% of the time. The error rate or (1 - accuracy) was 0.2358 or 23.58% of the predictions were incorrect.
Part of the reason I chose this dataset is to see if we could use Naive Bayes Classification to predict a team based on their statistics. I attempted to do so, but Naive Bayes was not able to predict teams with any degree of accuracy. Because of this, I switched the area of focus to predicting wins and losses based on a subset of important statistics.
The question I sought to answer is “How accurately can we predict wins and losses based on the box score variables?”. I am particularly interested in if Naive Bayes is more accurate than Ridge Regression and K-Nearest Neighbors which I did similar predictions with.
I determined which variables to use using stepAIC and other model selection tools leading us to use points, 3 point makes, defensive rebounds, steals, blocks, and turnovers to predict if a team had won or lost.
Win or Loss ~ PTS + 3PM + DREB + STL + BLK + TOV
I used a randomized 80/20 split between training and testing data giving 1979 training observations and 481 testing observations. This split can be seen in the table on the right where we have a small subset of the testing observations with the prediction and actual result. Also shown are some game details and the variables we used to make our prediction.
Reference
Prediction 0 1
0 175 67
1 49 190
As you can see with the confusion matrix above, my model is about 75% accurate at predicting a win or a loss from the variables discussed with out-of-sample testing. The out-of-sample prediction accuracy using Ridge Regression was about 77%, so Naive Bayes was slightly worse in this case. 75% is still a fantastic rate given the circumstances.
| Team | Matchup | Game.Date | PTS | X3PM | DREB | STL | BLK | TOV | Win.or.Loss | Predicted.Result |
|---|---|---|---|---|---|---|---|---|---|---|
| PHX | PHX vs. SAC | 2022-04-10 | 109 | 14 | 32 | 9 | 7 | 11 | 0 | 1 |
| CLE | CLE vs. MIL | 2022-04-10 | 133 | 19 | 38 | 5 | 5 | 12 | 1 | 1 |
| ORL | ORL vs. MIA | 2022-04-10 | 125 | 23 | 42 | 4 | 3 | 10 | 1 | 1 |
| DET | DET @ PHI | 2022-04-10 | 106 | 11 | 27 | 4 | 4 | 20 | 0 | 0 |
| PHI | PHI vs. DET | 2022-04-10 | 118 | 5 | 32 | 13 | 6 | 11 | 1 | 1 |
| CHI | CHI @ MIN | 2022-04-10 | 124 | 10 | 32 | 9 | 3 | 23 | 1 | 0 |
| GSW | GSW @ NOP | 2022-04-10 | 128 | 19 | 34 | 5 | 2 | 17 | 1 | 1 |
| LAC | LAC vs. OKC | 2022-04-10 | 138 | 18 | 45 | 4 | 8 | 9 | 1 | 1 |
| NOP | NOP @ MEM | 2022-04-09 | 114 | 6 | 20 | 10 | 3 | 16 | 0 | 0 |
| BKN | BKN vs. CLE | 2022-04-08 | 118 | 12 | 32 | 5 | 8 | 11 | 1 | 1 |
| UTA | UTA vs. PHX | 2022-04-08 | 105 | 11 | 29 | 8 | 4 | 11 | 0 | 0 |
| MIL | MIL @ DET | 2022-04-08 | 131 | 11 | 41 | 7 | 1 | 8 | 1 | 1 |
| SAS | SAS @ MIN | 2022-04-07 | 121 | 10 | 37 | 5 | 5 | 12 | 0 | 1 |
| TOR | TOR vs. PHI | 2022-04-07 | 119 | 15 | 29 | 7 | 4 | 11 | 1 | 1 |
| LAC | LAC vs. PHX | 2022-04-06 | 113 | 12 | 46 | 7 | 9 | 17 | 1 | 1 |
| WAS | WAS @ ATL | 2022-04-06 | 103 | 10 | 38 | 4 | 4 | 14 | 0 | 0 |
| PHX | PHX @ LAC | 2022-04-06 | 109 | 17 | 36 | 11 | 2 | 12 | 0 | 1 |
| BKN | BKN vs. HOU | 2022-04-05 | 118 | 15 | 40 | 6 | 9 | 17 | 1 | 1 |
| CHA | CHA @ MIA | 2022-04-05 | 115 | 12 | 24 | 8 | 3 | 15 | 0 | 0 |
| CHI | CHI vs. MIL | 2022-04-05 | 106 | 9 | 29 | 8 | 2 | 15 | 0 | 0 |
---
title: "Analyzing NBA Games Using ML Techniques"
author: "Adam White"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
source_code: embed
theme: united
rmarkdown: render("your_dashboard.Rmd",
output_file = "dashboard.html",
self_contained = TRUE)
---
```{r setup, include=FALSE}
library(flexdashboard)
library(kableExtra)
library(dplyr)
library(ggplot2)
library(teamcolors)
library(knitr) # I recommend doing this here
library(olsrr)
library(ggplot2)
library(leaps)
library(faraway)
library(GGally)
library(car)
library(readxl)
library(olsrr)
library(robustbase)
library(splines)
library(FNN)
library(gmodels)
library(caret)
library(kknn)
library(gmodels)
library(dplyr)
library(sjPlot)
library(sjlabelled)
library(sjmisc)
library(readxl)
nba = read_xlsx("/Users/Adam/Downloads/4214_data.xlsx")
colnames(nba) <- c("Team", "Matchup", "Game Date", "Win or Loss", "MIN", "PTS", "FGM", "FGA", "FGP", "3PM", "3PA", "3PP", "FTM", "FTA", "FTP", "OREB", "DREB", "REB", "AST", "STL", "BLK", "TOV", "PF", "Plus Minus")
nba$Team <- factor(nba$Team, levels = unique(nba$Team))
```
Data Summary
=====================================
Column {data-width=450}
-----------------------------------------------------------------------
### Data Description {data-height=130}
The data set I worked with is box scores from NBA games from the 2021-2022 NBA season. This data was retrieved from NBA.com and has 24 columns and 2461 rows. Each row of the dataset is a team's performance for a given game, meaning that each game has two rows, one for each team.
https://docs.google.com/spreadsheets/d/1FcJVyAggt7qLm3J-gh1_cReJeQn_8jOQd77BaDY3crE/edit?usp=sharing
```{r}
```
```{r data_dictionary}
x <- data.frame(colnames(nba), c("String", "String", "String", "Boolean", "Double", "Double", "Double", "Double", "Double", "Double", "Double", "Double", "Double", "Double", "Double", "Double", "Double", "Double", "Double", "Double", "Double", "Double", "Double", "Double"), c("Team who's data is shown", "Teams playing in the game", "Date of the game", "If the team won or lost", "Length of game play (minutes)", "Points scored", "Field goals made", "Field goals attempted", "Field goal percentage", "3 pointers made", "3 pointers attempted", "3 point percentage", "Free throws made", "Free throws attempted", "Free throw percentage", "Offensive rebounds", "Defensive rebounds", "Total rebounds", "Assists", "Steals", "Blocks", "Turnovers", "Personal fouls", "Team Plus-Minus"))
names(x) <- c("Column", "Data Type", "Description")
kable(x)
```
Column {data-width=1550}
-----------------------------------------------------------------------
### Sample Data
```{r table_of_data}
kable(nba[1:13,]) %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
```
Column
-----------------------------------------------------------------------
### Team Wins
```{r barplot_team_wins}
nba$`Win or Loss` <- c("W" = 1, "L" = 0)[nba$`Win or Loss`]
tm_wins <- nba %>%
group_by(Team) %>%
summarize(Wins = sum(`Win or Loss`))
library(plotly)
ggplotly(ggplot(tm_wins, aes(Wins, Team, fill = Team)) +
geom_bar(stat="identity") +
scale_fill_manual(values = c("ATL" = "#e13a3e", "BKN" = "#061922", "BOS" = "#008348", "CHA" = "#006bb6", "CHI" = "#ce1141", "CLE" = "#860038", "DAL" = "#007dc5", "DEN" = "#4d90cd", "DET" = "#ed174c", "GSW" = "#fdb927", "HOU" = "#ce1141", "IND" = "#ffc633", "LAC" = "#ed174c", "LAL" = "#fdb927", "MEM" = "#0f586c", "MIA" = "#98002e", "MIL" = "#00471b", "MIN" = "#005083", "NOP" = "#002b5c", "NYK" = "#006bb6", "OKC" = "#007dc3", "ORL" = "#007dc5", "PHI" = "#ed174c", "PHX" = "#e56020", "POR" = "#e03a3e", "SAC" = "#724c9f", "SAS" = "#bac3c9", "TOR" = "#ce1141", "UTA" = "#002b5c", "WAS" = "#002b5c")
)+
theme_bw() +
theme(legend.position = "none"))
```
Logistic Regression
=====================================
Column {data-width=250}
-----------------------------------------------------------------------
### Logistic Regression
I wanted to ask myself "What variables specifically are the most impactful for deciding a teams winning probability?" and what specific values of our variables would we predict a win vs a loss from.
Logistic regression is a great method for prediction between two states, so I used it to predict whether a team won or lost a game. We used the same variables as our NaiveBayes model which are points, 3 point makes, defensive rebounds, steals, blocks, and turnovers to predict wins and losses.
Shown in the plots on the right is a logistic regression fit of the four most significant variables (points, defensive rebounds, steals, and turnovers) to wins. The strongest predictors are points and defensive rebounds as clearly shown in their logistic regression plots. At and above 110 points scored, teams are more likely to win than lose. Above 34 defensive rebounds, winning percentage is above 0.5. The slope of the logistic fit for steals is far more consistent than that of points and defensive rebounds. This is because having a lot of steals is less impact on the outcome of the game than the other two categories. Turnovers are similar to steals in that they have a lesser impact on the outcome of the game than the other two, however it is the only fit with a negative slope. All four plots are as expected and do a great job of showing which variables are most important to winning and what values teams should aim for.
Row
-----------------------------------------------------------------------
### Points
```{r Logistic_Regression}
nba$`Win or Loss` <- as.numeric(factor(nba$`Win or Loss`, levels = unique(nba$`Win or Loss`)))-1
nba$WL <- factor(nba$`Win or Loss`, levels = unique(nba$`Win or Loss`))
modelfit <- glm(`Win or Loss` ~ PTS + `3PM` + DREB + STL + BLK + TOV, data = nba, family = binomial)
ggplotly(ggplot(nba, aes(x = PTS, y = `Win or Loss`)) +
geom_point(aes(color = WL), position = position_jitter(height = 0.03, width = 0)) +
geom_smooth(method = "glm", method.args = list(family="binomial")) +
scale_color_manual(name = "Win or Loss", values = c("#861F41", "#E87722")) +
labs(title = "Logistic Regression Fit to Wins by Points",
x = "Points",
y = "P(Win)") +
theme_bw())
```
### Defensive Rebounds
```{r lo}
ggplotly(ggplot(nba, aes(x = DREB, y = `Win or Loss`)) +
geom_point(aes(color = WL), position = position_jitter(height = 0.03, width = 0)) +
geom_smooth(method = "glm", method.args = list(family="binomial")) +
scale_color_manual(name = "Win or Loss", values = c("#861F41", "#E87722")) +
labs(title = "Logistic Regression Fit to Wins by Defensive Rebounds",
x = "Defensive Rebounds",
y = "P(Win)") +
theme_bw())
```
Row
-----------------------------------------------------------------------
### Steals
```{r log_reg_2}
ggplotly(ggplot(nba, aes(x = STL, y = `Win or Loss`)) +
geom_point(aes(color = WL), position = position_jitter(height = 0.03, width = 0)) +
geom_smooth(method = "glm", method.args = list(family="binomial")) +
scale_color_manual(name = "Win or Loss", values = c("#861F41", "#E87722")) +
labs(title = "Logistic Regression Fit to Wins by Steals",
x = "Steals",
y = "P(Win)") +
theme_bw())
```
### Turnovers
```{r l}
ggplotly(ggplot(nba, aes(x = TOV, y = `Win or Loss`)) +
geom_point(aes(color = WL), position = position_jitter(height = 0.03, width = 0)) +
geom_smooth(method = "glm", method.args = list(family="binomial")) +
scale_color_manual(name = "Win or Loss", values = c("#861F41", "#E87722")) +
labs(title = "Logistic Regression Fit to Wins by Turnovers",
x = "Turnovers",
y = "P(Win)") +
theme_bw())
```
Multiple Linear Regression
=====================================
Column {data-width=1000}
-----------------------------------------------------------------------
### Model Selection
For multiple linear regression the problem I was trying to answer is what is the best model for predicting the +/- (Plus/Minus) a team had based on their other box score stats. First I will use model selection to attempt to find the most precise model, and then I will analyze the results of that model and assess the accuracy when predicting the number of rebounds of the game.
```{r}
nba_dat = read_xlsx("/Users/Adam/Downloads/4214_data.xlsx")
names(nba_dat)[4] = "WL"
nba_dat$WL = c("W" = 1, "L" = 0)[nba_dat$WL]
nba_clean = nba_dat[,4:23]
nba_clean = nba_clean[,-2]
nba_clean = nba_clean[,-13] # removing DREB since it is highly correlated w REB
nba_clean = nba_clean[,-10] # removing FTA since correlation w FTM
nba_clean = nba_clean[-1797,]
names(nba_clean)[10] = "FT_perc"
names(nba_clean)[6] = "Three_PM"
names(nba_clean)[7] = "Three_PA"
names(nba_clean)[1] = "WL"
names(nba_clean)[5] = "FG_perc"
names(nba_clean)[8] = "Three_perc"
```
```{r}
library(readxl)
library(car)
library(bestglm)
nba2 = read_xlsx("/Users/Adam/Downloads/4214_data.xlsx")
df = cbind(pts = nba2$PTS, FGM = nba2$FGM, FGA = nba2$FGA, FGP = nba2$`FG%`,
TPM = nba2$`3PM`, TPA = nba2$`3PA`, TPP = nba2$`3P%`, FTM = nba2$FTM,
FTA=nba2$FTA, FTP=nba2$`FT%`, OREB=nba2$OREB, DREB=nba2$DREB, REB=nba2$REB,
AST=nba2$AST, STL=nba2$STL, BLK=nba2$BLK, TOV=nba2$TOV, PF=nba2$PF,
PM=nba2$`+/-`)
df = as.data.frame(df)
df = df[,-2] #removing FGM
df = df[,-8] # removing FTA
df = df[,-10] # removing DREB
bestglm(df, IC = "AIC", method = "exhaustive")
best_mod = lm(`+/-` ~ PTS + FGA + `3P%` + FTM + `FT%` + OREB + REB + AST + STL +
BLK + TOV + PF, data = nba2)
```
Once again, I am attempting to use the information from these NBA box statistics to accurately predict the +/- of the game. The +/- (plus minus) is an integer that is the total number of points a team scored minus the total number the opponent scored. It will be negative if the team lost, and positive if the team won. I had to remove FGM, FTA, and DREB, as there was multicolinearity present with these and other variables. I decided to do best subsets approach for my variable selection for my model. Using AIC as my information criterion I was left with a model that included more variables than I expected. The model I got was +/- = PTS + FGA + 3P% + FTM + FT% + OREB + REB + AST + STL + BLK + TOV + PF. All of the variables in this model were statistically significant. The model with coefficients (rounded to the 10 thousandth) is +/- = -42.0838 + 0.7860 * PTS - 1.3189 * FGA + 0.0890 * 3P% - 0.8629 * FTM + 0.1916 * FT% - 0.1314 * OREB + 1.5264 * REB + 0.0924 * AST + 1.4931 * STL + 0.3754 * BLK - 1.1974 * TOV + 0.1303 * PF. One interesting thing I noticed was that rebounds had the largest coefficient, which surprised me. I thought it would be points but maybe since there are more points than rebounds in a game that rebounds had a larger estimated coefficient. The adjusted R-squared of this model was 0.7313 or that 73.13% of the variability of the plus minus can be explained by this model. Now we can continue on with checking our assumptions for multiple variable linear regression. These values can be shown in the coefficient plot on the right hand side, which has the coefficient, as well as a confidence interval for that variable's optimal coefficient.
Row
-----------------------------------------------------------------------
### Coefficients Plot
```{r fig.width = 9, fig.height= 7}
plot_model(best_mod, title = "Coefficients Value and CI of Model")
```
We can see that when deciding a teams +/- we have rebounds and steals as the two most important variables. Intuitively this makes sense, as when you rebound the ball, you ensure another possession, whether the opposing team missed a shot or your team did. Furthermore, when your team gets a steal, you ensure another possession, which can explain the large coefficients found for both variables. One thing I found very surprising was that Field Goals Attempted (FGA) has the lowest estimated coefficient, at a value of -1.32. This can be explained by the model accounting for teams that are just chucking up shots (volume over quality), which is why points has a positive coefficient, so if you shoot you get penalized but if the shot goes in it counteracts the penalty. Finally, the second lowest estimated coefficient is for turnovers (TOV) as the opposite of steals, you ensuring that you lost a possession.
Ridge Regression
=====================================
Column {data-width=500}
-----------------------------------------------------------------------
### Ridge Regression
The question I sought to answer is "How accurately can we predict wins and losses based on the box score variables?".
I determined which variables to use using stepAIC and other model selection tools leading us to use points, 3 point makes, defensive rebounds, steals, blocks, and turnovers to predict if a team had won or lost.
Win or Loss ~ PTS + 3PM + DREB + STL + BLK + TOV
I used a randomized 80/20 split between training and testing data giving 1979 training observations and 481 testing observations. This split can be seen in the table on the right where we have a small subset of the testing observations with the prediction and actual result. Also shown are some game details and the variables I used to make my predictions. One thing to keep in mind is that each observation is only one teams perspective and that the opposing team's perspective is recoded in a seperate row. An example of this is with rows 21 and 22, shown on the right, where the Detroit Pistons lost the Philidelphia 76ers in Philly on 4/10/22.
### Confusion Matrix
```{r ridge_regression}
library(MASS)
library(glmnet)
library(dplyr)
model <- glmnet(model.matrix(`Win or Loss` ~ PTS + `3PM` + DREB + STL + BLK + TOV, data = nba), data.matrix(nba$`Win or Loss`))
set.seed(1)
trainIndex <- createDataPartition(nba$Team, p = 0.8, list = FALSE)
nba$`Win or Loss` <- factor(nba$`Win or Loss`, levels = unique(nba$`Win or Loss`))
train <- nba[trainIndex, ]
test <- nba[-trainIndex, ]
x <- test[, c("Win or Loss", "PTS", "3PM", "DREB", "STL", "BLK", "TOV")] %>% data.matrix()
preds <- predict(model, newdata = test, newx = x, type = "response")
pred_categories <- ifelse(preds > 0.5, 1, 0)[,61]
uh <- data.frame(test[,c(1,2,3,6,10,17,20,21,22)], "Actual Result" = test[,4], "Predicted Result" = pred_categories)
uh$Predicted.Result <- as.factor(uh$Predicted.Result)
as.table(confusionMatrix(uh$Predicted.Result, test$`Win or Loss`))
```
As you can see with the confusion matrix above, our model is about 77% accurate at predicting a win or a loss from the variables discussed with out-of-sample testing. This is an amazing accuracy for the situation since all of these variables are highly dependent on the pace of the game. For example, it is hard to use only these variables to make predictions for both a slow-paced team and a fast-paced team.
Column
-----------------------------------------------------------------------
### Predictions
```{r Predictions_ridge}
kable(uh[1:20,], align = "r") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
```
Natural Cubic Splines
=====================================
Column
-----------------------------------------------------------------------
For my natural cubic splines methodology I am trying to find the degree of freedom (DF) that minimizes the in sample sum of squares error. U will be using a response variable of win or loss with points as the predictor. After I find the best fit natural cubic splines we will then work on finding the best natural cubic splines on the out of sample SSE or the testing data set.
```{r include = FALSE}
SSE = rep(0,8)
# Splitting training and testing data
nba_clean$WL = as.numeric(nba_clean$WL)
nba_train = nba_clean[1:1800,]
nba_test = nba_clean[1800:2460,]
for (i in 1:9){
ns = lm(WL ~ ns(PTS, df = i), data = nba_train)
pred.ns <- predict(ns, newdata = data.frame(PTS = nba_train$PTS), se = T)
SSE[i] = sum((pred.ns$fit - nba_train$WL)**2)
}
dfs = (1:9)
dfSSE = cbind(dfs, SSE)
mat = data.frame(dfs, SSE)
mat
```
```{r}
print(mat)
#ggplot(data = mat, aes(x = dfs, y = SSE)) + geom_point() +
#ggtitle("SSE of Sample1 Training versus Degrees of Freedom")
```
As we can see from both the data frame of DFs vs SSE and the plot of DFs vs SSE, the optimal number of degrees of freedom is 8 with a sum of squares error of 336.5399. Now we will plot the fit corresponding to 8 degrees of freedom. As we can see in the plot, increasing DF by generally decreases SSE. It appears to start to plateau and if we were to run with a higher max DF we would see that in the graph.
```{r, fig.height= 4, fig.width=6, echo = FALSE}
best.ns = lm(WL ~ ns(PTS, df = 8), data = nba_train)
pred.ns <- predict(best.ns, newdata = data.frame(PTS = nba_train$PTS),
se = T)
ggplot(nba_train, aes(x = PTS, y = WL))+geom_point(pch = 1, color = "gray") +
geom_line(data = data.frame(PTS = nba_train$PTS, sp = pred.ns$fit),
aes(x = PTS, y = sp), color = "red") +
ggtitle("Natrual Cubic Splines of NBA Training data with df = 8") +
geom_line(data = data.frame(PTS = nba_train$PTS, up = pred.ns$fit +
2*pred.ns$se.fit), aes(x = PTS, y = up),
linetype = 2) +
geom_line(data = data.frame(PTS = nba_train$PTS, up = pred.ns$fit -
2*pred.ns$se.fit), aes(x = PTS, y = up),
linetype = 2) +
geom_vline(xintercept = attributes(ns(nba_train$PTS, df = 8))$knots, linetype
= "dashed", color = "grey30")
```
As we can see from the natural cubic splines graph, we have a relationship between points scored and win or loss. As expected, the more points a team scores, the better chance they have of winning. The natural cubic splines has a much larger standard error towards the ends of the data set, as the accuracy is not nearly as low. This is due to the low number of data points at the ends of the points and in turn the model can not be as confident in its predictions.
Column
-----------------------------------------------------------------------
```{r, fig.height= 4, fig.width=6, echo = FALSE}
SSEt = rep(0, 8)
for (i in 1:9){
ns = lm(WL ~ ns(PTS, df = i), data = nba_train)
pred.ns <- predict(ns, newdata = data.frame(PTS = nba_test$PTS), se = T)
SSEt[i] = sum((pred.ns$fit[1:660] - nba_test$WL[1:660])**2)
}
mat1t = data.frame(dfs, SSEt)
print(mat1t)
#ggplot(data = mat1t, aes(x = dfs, y = SSEt)) + geom_point() +
#ggtitle("SSE of Sample1 Training versus Degrees of Freedom")
```
Now I am attempting to fit a natural cubic spline on the testing data set. I will see if we can minimize the sum of square error using the natural spline for the training data set. I am still using the same response, win or loss, and the same predictor, points. We can see from this graph of df versus SSE, that the optimal number of degrees of freedom is 4. This minimizes the sum of squares error to be 127.6512. This is far lower than the degree of freedom we got from the training data set which was 8, but the size of the data is smaller too.
```{r, fig.height= 4, fig.width=6, echo = FALSE}
best.ns.t = lm(WL ~ ns(PTS, df = 4), data = nba_test)
pred.ns.t <- predict(best.ns.t, newdata = data.frame(PTS = nba_test$PTS),
se = T)
ggplot(nba_test, aes(x = PTS, y = WL))+geom_point(pch = 1, color = "gray") +
geom_line(data = data.frame(PTS = nba_test$PTS, sp = pred.ns.t$fit),
aes(x = PTS, y = sp), color = "red") +
ggtitle("Natrual Cubic Splines of NBA Testing data with df = 4") +
geom_line(data = data.frame(PTS = nba_test$PTS, up = pred.ns.t$fit +
2*pred.ns.t$se.fit), aes(x = PTS, y = up),
linetype = 2) +
geom_line(data = data.frame(PTS = nba_test$PTS, up = pred.ns.t$fit -
2*pred.ns.t$se.fit), aes(x = PTS, y = up),
linetype = 2) +
geom_vline(xintercept = attributes(ns(nba_test$PTS, df = 4))$knots, linetype
= "dashed", color = "grey30")
```
We can see that knots are all close to the center, which means the different cubic polynomials are joined right around the middle, or the 98 - 115 point range. It appears to be linear in the middle and quadratic around the ends of the graph. This was suprising as I assumed that as the number of points increases the probability of winning would also increase. We can see that around the ends the spline starts to flare out as their is less data around these points.
K-Nearest Neighbors Classification
=====================================
Column
-----------------------------------------------------------------------
For K-Nearest Neighbors I will be attempting to predict if a team Wins or Losses based off the number of rebounds they secured, number of points scored, and turnovers. I will be converting Wins or Losses into a factor to enable KNN to be a classification to predict if the team won or lost based off points, rebounds and turnovers. Then we will be comparing this to a prediction using all the variables.
```{r, echo = FALSE}
set.seed(0)
nba_clean$WL = as.factor(nba_clean$WL)
index <- sample(1:nrow(nba_clean), round(nrow(nba_clean) * 0.7))
training_df <- nba_clean[index, ]
testing_df <- nba_clean[-index, ]
train_classes <- training_df$WL
test_classes <- testing_df$WL
train_features <- data.frame(cbind(training_df[2], training_df[12], training_df[16]))
test_features <- data.frame(cbind(testing_df[2], testing_df[12], testing_df[16]))
knn_classes <- knn(train = train_features, test = test_features,
cl = train_classes, k = 5)
CrossTable(x = knn_classes, y = test_classes, prop.chisq = FALSE,
prop.t = F, prop.r = F)
confusionMatrix(knn_classes, test_classes)
```
When using rebounds, points, and turnovers as features for predicting if the team won or lost that game, I got the cross table displayed. If you look in the top left corner you can see how accurate K-nearest neighbors was at predicting if the team lost. I can see that they correctly predicted it 254 times out of 351 games. This means it was 72.4% accurate at predicting if the team lost or not. If you look at the second diagonal you can see the number of times KNN predicted correctly if the team won or not. It did so 280 times out of 387 games total, which comes out to approximately 72.4% accurate. Now we will try to predict the same thing but we will use all of the variables in our data. The overall accuracy equation is Sum(DIAG)/Sum(Everything) which equals (250 + 287)/(250 + 287 + 99 + 102) = 0.7276423, which is what we got from the confusion matrix.
Column
-----------------------------------------------------------------------
### Using all variables
```{r, echo = FALSE}
set.seed(0)
train_features_all <- data.frame(training_df[2:17])
test_features_all <- data.frame(testing_df[2:17])
knn_classes_all <- knn(train = train_features_all, test = test_features_all,
cl = train_classes, k = 10)
CrossTable(x = knn_classes_all, y = test_classes, prop.chisq = FALSE,
prop.t = F, prop.r = F)
confusionMatrix(knn_classes_all, test_classes)
```
When I conducted K-Nearest Neighbors using all relevant variables, we get the cross table and the confusion matrix displayed above. When trying to predict on the 738 testing observation the model correctly predict 299 losses and 265 wins. To get the overall accuracy I can add these together and divide by the total, (299 + 265)/738 = 76.42276%. This is the same accuracy value we got from the confusionMatrix just below the cross table. There were a total of 352 losses and 386 wins in the testing data. This means that we predicted the losses 299/352 = 84.9% of the time and the wins 265/286 = 68.7% of the time. The error rate or (1 - accuracy) was 0.2358 or 23.58% of the predictions were incorrect.
Naive Bayes
=====================================
Column {data-width=500}
-----------------------------------------------------------------------
### Naive Bayes
Part of the reason I chose this dataset is to see if we could use Naive Bayes Classification to predict a team based on their statistics. I attempted to do so, but Naive Bayes was not able to predict teams with any degree of accuracy. Because of this, I switched the area of focus to predicting wins and losses based on a subset of important statistics.
The question I sought to answer is "How accurately can we predict wins and losses based on the box score variables?". I am particularly interested in if Naive Bayes is more accurate than Ridge Regression and K-Nearest Neighbors which I did similar predictions with.
I determined which variables to use using stepAIC and other model selection tools leading us to use points, 3 point makes, defensive rebounds, steals, blocks, and turnovers to predict if a team had won or lost.
Win or Loss ~ PTS + 3PM + DREB + STL + BLK + TOV
I used a randomized 80/20 split between training and testing data giving 1979 training observations and 481 testing observations. This split can be seen in the table on the right where we have a small subset of the testing observations with the prediction and actual result. Also shown are some game details and the variables we used to make our prediction.
### Confusion Matrix
```{r Naive_Bayes_Confusion}
library(e1071)
model <- naiveBayes(`Win or Loss` ~ PTS + `3PM` + DREB + STL + BLK + TOV, data = nba)
library(caret)
set.seed(1)
trainIndex <- createDataPartition(nba$Team, p = 0.8, list = FALSE)
nba$`Win or Loss` <- factor(nba$`Win or Loss`, levels = unique(nba$`Win or Loss`))
train <- nba[trainIndex, ]
test <- nba[-trainIndex, ]
preds <- predict(model, newdata = test)
uh <- data.frame(test[,c(1,2,3,6,10,17,20,21,22)], "Actual Result" = test[,4], "Predicted Result" = preds)
as.table(confusionMatrix(preds, test$`Win or Loss`))
```
As you can see with the confusion matrix above, my model is about 75% accurate at predicting a win or a loss from the variables discussed with out-of-sample testing. The out-of-sample prediction accuracy using Ridge Regression was about 77%, so Naive Bayes was slightly worse in this case. 75% is still a fantastic rate given the circumstances.
Column
-----------------------------------------------------------------------
### Predictions
```{r Predictions}
kable(uh[1:20,], align = "r") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
```